Efficient passage retrieval using document metadata

ABSTRACT

A system, method and computer program product for efficiently retrieving relevant passages to questions based on a corpus of data. A processor device receives an input query and performs a query analysis to obtain searchable query terms. The processor performs: matching metadata associated with one or more documents against the query terms. The document metadata includes one or more of: a title of the documents, one or more user tags or clouds. Then the processor device performs: mapping matched document metadata to corresponding one or more documents; identifying corresponding matched documents to form a subcorpus of documents; and conducting a search in the data subcorpus using the searchable query terms to obtain one or more passages relevant input query from the identified documents.

CROSS-REFERENCE TO RELATED APPLICATIONS

The present invention relates to and claims the benefit of the filingdate of commonly-owned, co-pending U.S. patent application Ser. No.13/244,347 filed Sep. 24, 2011 which claims the benefit of the filingdate of U.S. Provisional Patent Application No. 61/386,019, filed Sep.24, 2010, the entire contents and disclosure of which is incorporated byreference as if fully set forth herein.

BACKGROUND

The invention relates generally to information retrieval systems, andmore particularly, the invention relates to an automated query/answersystem and method implementing a passage retrieval component to conducta search that identifies passages relevant to a given question usingdocument metadata from a collection including text-based resources.

DESCRIPTION OF THE RELATED ART

An introduction to the current issues and approaches of questionanswering (QA) can be found in the web-based referencehttp://en.wikipedia.org/wiki/Question_answering. Generally, QA is a typeof information retrieval. Given a collection of documents (such as theWorld Wide Web or a local collection) the system should be able toretrieve answers to questions posed in natural language. QA is regardedas requiring more complex natural language processing (NLP) techniquesthan other types of information retrieval such as document retrieval,and it is sometimes regarded as the next step beyond search engines.

QA research attempts to deal with a wide range of question typesincluding: fact, list, definition, How, Why, hypothetical,semantically-constrained, and cross-lingual questions. Searchcollections vary from small local document collections, to internalorganization documents, to compiled newswire reports, to the World WideWeb.

Closed-domain QA deals with questions under a specific domain, forexample medicine or automotive maintenance, and can be seen as an easiertask because NLP systems can exploit domain-specific knowledgefrequently formalized in ontologies. Open-domain QA deals with questionsabout nearly everything, and can only rely on general ontologies andworld knowledge. On the other hand, these systems usually have much moredata available from which to extract the answer.

Alternatively, closed-domain QA might refer to a situation where only alimited type of questions are accepted, such as questions asking fordescriptive rather than procedural information.

Access to information is currently dominated by two paradigms. First, adatabase query that answers questions about what is in a collection ofstructured records. Second, a search that delivers a collection ofdocument links in response to a query against a collection ofunstructured data, for example, text or html.

A major unsolved problem in such information query paradigms is the lackof a computer program capable of accurately answering factual questionsbased on information included in a collection of documents that can beeither structured, unstructured, or both. Such factual questions can beeither broad, such as “what are the risks of vitamin K deficiency?”, ornarrow, such as “when and where was Hillary Clinton's father born?”

It is a challenge to understand the query, to find appropriate documentsthat might contain the answer, and to extract the correct answer to bedelivered to the user. There is a need to further advance themethodologies for answering open-domain questions.

SUMMARY

In one aspect there is provided a computing infrastructure andmethodology that conducts question and answering and performs automaticpassage retrieval operations in a highly efficient manner.

In one aspect, there is provided a computer-implemented method forefficiently retrieving relevant passages to questions based on a corpusof data comprising: receiving an input query; performing a query contextanalysis upon the input query to obtain searchable query terms; matchingmetadata associated with one or more documents against the query terms;mapping matched document metadata to corresponding one or moredocuments; identifying corresponding matched documents to form asubcorpus of documents; and conducting a search in the data subcorpususing the searchable query terms to obtain one or more passages relevantto the input query from the identified documents, wherein one or moreprocessor devices performs one or more the retrieving, performing,matching, mapping, identifying and conducting.

In this aspect, the document metadata includes one or more of: a titleof the documents, one or more user tags, one or more automaticallyidentified document labels.

Further to this aspect, prior to matching of metadata associated withone or more documents against the query terms there is performed:extracting document metadata from one or more documents of a corpus ofdocuments; providing the extracted document metadata as a dictionary ina storage device, each document metadata stored in the dictionary beingassociated with a corresponding document identification (ID), whereinthe matching of metadata against the query terms comprises: performing,by the processor device, a dictionary matching.

In an alternate embodiment, there is provided a computer-implementedmethod for efficiently retrieving relevant passages to questions basedon a corpus of data comprising: receiving, at a processor device, aninput query; performing, at the processor device, a query contextanalysis upon the input query to obtain searchable query terms;accessing a dictionary of document metadata obtained from one or moredocuments of the data corpus, each stored document metadata beingassociated with a corresponding document identification (ID);performing, by the processor device, a dictionary matching of themetadata associated with one or more documents against the query terms;mapping matched document metadata to corresponding one or more documentIDs; identifying corresponding matched documents to form a subcorpus ofdocuments; and conducting a search in the subcorpus using the searchablequery terms to obtain passages relevant to the input query from theidentified documents.

A computer program product is provided for performing operations. Thecomputer program product includes a storage medium readable by aprocessing circuit and storing instructions run by the processingcircuit for running a method(s). The method(s) are the same as listedabove.

BRIEF DESCRIPTION OF THE DRAWINGS

The objects, features and advantages of the invention are understoodwithin the context of the Detailed Description, as set forth below. TheDetailed Description is understood within the context of theaccompanying drawings, which form a material part of this disclosure,wherein:

FIG. 1 shows a prior art high level logical architecture 10 of aquestion/answering method in which the present invention may beemployed;

FIG. 2 is a schematic diagram depicting passage retrieval components 75according to one embodiment;

FIG. 3 is a flow diagram illustrating a method 100 for performingpassage retrieval operations in one embodiment; and,

FIG. 4 illustrates an exemplary hardware configuration to run methodsteps described in FIG. 3 in one embodiment.

DETAILED DESCRIPTION

FIG. 1 shows a QA system diagram such as described in U.S. patentapplication Ser. No. 12/126,642 depicting a high-level logicalarchitecture 10 and methodology in which the present system and methodmay be employed in one embodiment.

FIG. 1 illustrates the major components that comprise a canonicalquestion answering system 10 and their workflow. The question analysiscomponent 20 receives a natural language question 19 (e.g., “Who is the42-president of the United States?”) and analyzes the question toproduce, minimally, the semantic type of the expected answer (in thisexample, “president”), and optionally other analysis results fordownstream processing. The search component 30 a formulates queries fromthe output 29 of question analysis and consults various resources suchas the World Wide Web 41 or one or more knowledge resources, e.g.,databases, knowledge bases 42, to retrieve “documents” including, e.g.,whole documents or document portions 44, e.g., web-pages, databasetuples, etc., having “passages” 44 that are relevant to answering thequestion. The candidate answer generation component 30 b may thenextract from the search results 48 potential (candidate) answers to thequestion, which are then scored and ranked by the answer selectioncomponent 50 to produce a final ranked list of answers with associatedconfidence scores.

In current questions and answer systems, one key component is thepassage retrieval operations conducted when searching for candidateanswers in heterogeneous collection of structured, semi-structured andunstructured information resources. Passage retrieval operations adapt asearch engine at its core to identify passages relevant to a givenquestion from the collection of sources, e.g., text-based sources.Passage retrieval is also relevant to any search application whereselecting passages containing, for example, 1-3 sentences is moreappropriate than retrieving entire documents either for processing bydownstream components, or for presentation to the end user.

Most existing systems performing a passage retrieval operation adoptsone of two approaches. The first approach is to adopt a document searchengine to retrieve a list of relevant documents using the searchengine's internal document ranking criteria, and to apply a custompost-hoc passage scoring algorithm to identify the most relevant textsegments from these documents. The second approach is to adopt a searchengine with passage retrieval capability and to make use of the engine'sinternal ranking algorithm to return a set of relevant passages. Ineither approach, the retrieval process is performed over the entirecollection, which typically contains millions of documents or more. Thisposes an efficiency issue for real-time question answering systems thatmust deliver answers to users in no more than a few seconds. A typicalsolution for this problem is to split the search index into multiplesubindices on multiple machines so that retrieval against the subindicescan be performed in parallel and their result merged. While thissolution addresses the efficiency issue, it poses other problems relatedto merging search results from multiple indices.

It would be highly desirable to provide a system and method thatimproves the efficiency of passage retrieval based on dynamic subcorpusselection to constrain the number of relevant documents considered inthe retrieval process.

In one embodiment, the present system and method for efficient passageretrieval against a corpus given a question is applicable and may bepart of a Question Answering (QA) system. Alternatively, the system andmethod for efficient passage retrieval against a corpus given a questionmay be implemented in non-QA applications, i.e., applicationsimplemented to return a passage, for example, a 1-sentence to 3-sentencepassage most relevant to a question, as opposed to an answer per se.

Commonly-owned, co-pending U.S. patent application Ser. No. 12/126,642,titled “SYSTEM AND METHOD FOR PROVIDING QUESTION AND ANSWERS WITHDEFERRED TYPE EVALUATION” and co-pending U.S. patent application Ser.No. 12/152,411, titled “SYSTEM AND METHOD FOR PROVIDING ANSWERS TOQUESTIONS” are both incorporated by reference herein, and describe a QA(Question and Answer) system and method in which the present passageretrieval system may be incorporated.

In one embodiment, the present disclosure may extend and complement theeffectiveness of a QA or non-QA system and method by improving theefficiency of passage retrieval operations based on dynamic subcorpusselection to constrain the number of relevant documents considered inthe retrieval process.

In one embodiment, the subcorpus selection process is based on amatching algorithm that identifies relevant documents based on thequestion text and metadata associated with the documents in thecollection, such as document titles, user tags (“clouds”), orautomatically identified document labels. The passage retrieval processis then restricted to return passages only from this subcorpus, whichtypically contains several orders of magnitude fewer documents than theentire collection.

The approach to efficient passage retrieval significantly constrains thepool of documents from which passages may be retrieved based on metadataassociated with documents, such as document titles and user tags(“clouds”). The efficiency of passage retrieval is improved by providingthe ability to dynamically select a subcorpus from which search willtake place based on terms in the user question and metadata associatedwith documents in the corpus. More specifically, the user's inputquestion string is analyzed to extract all matches between questionterms and document metadata. Those matched documents comprise asubcorpus from which the system will extract passages for this question.

In a non-limiting example, there is considered the following userquestion

“which modern artist was Francoise Gilot, Dr. Jonas Salk's wife, oncethe companion of?”

In the example, matching the instances of document titles to the termsin the question, yields five entities: “modern”, “artist”, “FrancoiseGilot”, “Jonas Salk”, and “companion” are identified as document titlesin the corpus. It is understood that a term may map to multipledocuments with that title. For example, “companion” may map to anarticle that talks about a caregiver, or an architectural feature ofships, or a character in “Doctor Who”. Using the documentidentifications (IDs) that corresponds to each document title, thedocuments with the identified document IDs are selected to form asubcorpus consisting of potentially highly relevant documents foranswering the given question. The passage retrieval process is thenconstrained to finding the most relevant passages from this documentsubcorpus which may contain on the order of tens of documents, insteadof from the entire collection which many contain millions of documentsor more. In this example, several relevant passages, such as “FrancoiseGilot (born 1921) is a French born painter and is known as a companionof Picasso between 1944 and 1953” from the document titled “FrancoiseGilot”, and “In 1968, they divorced, and in 1970 Salk married FrancoiseGilot, the former mistress of Pablo Picasso” from the document titled“Jonas Salk”.

FIG. 2 is a schematic diagram depicting passage retrieval components 75that may be implemented in QA and non-QA systems according to oneembodiment. In one embodiment, the system components 75 conductingpassage retrieval operations make use of system modules from FIG. 1 suchas: the question analysis processing component 20 that performs a querycontext analysis upon an received input query to break down said inputquery into query terms, and any searchable components thereof; and, thesearch component 30 a that formulates queries from the output searchablecomponents of question analysis unit and that consults various resourcessuch as the World Wide Web 40 or one or more knowledge resources, e.g.,databases, knowledge bases 42.

More particularly, as shown in FIG. 2, the question analysis processingcomponent 20 includes a programmed matcher component 80 that functionsto identify document metadata present in the question. It performs thisby consulting a resource 84 containing document metadata information forall documents. Document metadata may include any information thatidentifies the topic or domain of the document, such as the documenttitle, manually or automatically derived category/domain classification,and crowdsourced or automatically derived tag clouds (“clouds”) whichindicate general topics of the document. It is against this dataresource 84 where matching of terms in the input question to thedocument metadata information is performed. Data corpus 89 representsthe entire data corpus that the QA or non-QA system is using and mayinclude both open domain and closed domain topics.

For example, a document containing George W. Bush's 2007 State of theUnion address may include the following metadata:

Title: 2007 State of the Union Address

Category: Presidential Addresses, George W. Bush Speeches, . . .

Tags: Security, Iraq, Terrorists, Health, America, . . .

A sample implementation of this matcher component 80 is to represent themetadata in dictionary form and to leverage a dictionary matcher toidentify dictionary terms that appear in an input question. For example,any matching component can be used to identify closed or open domaindictionary terms in text (e.g., legal terms, medical terms, or genericnamed entities) may be used. Thus, given a piece of text (an inputquery), the matching algorithm determines from the question text thoseterms that match entries in the dictionary. In one embodiment, adictionary matcher includes the open source ConceptMapper annotatoravailable athttp://uima.apache.org/sandbox.html#concept.mapper.annotator, whosefunctionality is incorporated by reference as if fully set forth herein.

The matched dictionary entries (question terms) are used to identify asubset of documents for the passage retrieval process. That is, for thequery terms that are mapped to the metadata (titles, tags, clouds) of adocument in the resource 84, that document's index (or other documentidentifier) is flagged, tagged, or recorded for its inclusion in asubcorpus. In one embodiment, each dictionary entry in resource 84encodes the document ID for each document that contains metadatamatching that dictionary term. The metadata and associated documentinformation in the dictionary entry that match the terms in the inputquestion is represented as 85 in FIG. 2.

The passage retrieval component can be any standard IR (InformationRetrieval) search engine 90 that supports both of: Retrieval of relevantshort passages, instead of full documents; and Runtime specification ofa relevant subcorpus for retrieval. One example IR search engine thatsatisfies this requirement is the Indri engine from the Lemur Toolkitsuch as the search engine with passage retrieval capability, such asIndri, http://www.lemurproject.org/indri/, incorporated by reference asif fully set forth herein.

In further view of FIG. 2 the matched documents identified by thematcher component 80 form a constrained document set 88, indicated inthe entire corpus 89 having the entire index and a subcorpus 92 is builtincluding the constrained document set 88 on which passage retrievaloperation via IR search engine 90 are performed to select the mostrelevant passages.

A passage retrieval method 100 employed by the passage retrievalcomponents 75 for improving the efficiency of passage retrieval isdescribed with respect to FIG. 3. As shown in FIG. 3, the method 100includes at 101, receiving at a processor device, an input query and,using a parser device or function, breaking down the query intosearchable query terms. In one embodiment, the obtained searchable queryterms from said input query are terms that match document metadata.Then, at 105, there is performed accessing a semi-structured source ofinformation containing document metadata (such as the title of thedocuments, a category, or user tags or clouds). In one embodiment, thesemi-structured source of information is a dictionary or corpus thatassociates data (e.g., definitions) with a large set of vocabulary itemsincluding document metadata stored in memory storage device.

That is, in one embodiment, the semi-structured source of informationmay be formed via off-line processes that extract document metadata fromone or more documents of a large corpus of documents. The extracteddocument metadata is stored as a dictionary in the memory storagedevice, with each document metadata stored in the dictionary having oneor more associated document identifications (IDs) that represent thosedocuments matching the metadata in that dictionary entry.

Then, at 110, the programmed processor device performs invoking amatching component to match a document metadata against the query terms.As mentioned, a dictionary matcher may be invoked that includes the opensource ConceptMapper annotator available athttp://uima.apache.org/sandbox.html#concept.mapper.annotator.

Continuing to 115, there is next performed mapping of the matcheddocument metadata to corresponding one or more document IDs. Then at120, from the corresponding IDs, there is performed identifying thecorresponding matched documents.

In one embodiment, for the matched document metadata found in thedictionary, the corresponding documents indicated by the mapped documentIDs are identified, e.g., flagged, tagged or recorded in the corpus inwhich the actual documents are electronically stored with their ID.Thus, in one embodiment, the identified corresponding matched documentsform the subcorpus 92 of documents including only the identified matchedmetadata documents of the larger corpus of documents. This step invokescorpus construction functionality to identify the subset of flagged,tagged or otherwise identified matched metadata documents obtained fromthe first corpus 84 (FIG. 2) during the matching step, whichfunctionality for dynamically constructing subcorpora during runtime isprovided for example in the above-incorporated Indri engine from theLemur Toolkit.

In an alternate embodiment, there may be further performed at 125,extracting the identified corresponding matched documents are found instep 120 as the subcorpus 92.

Then, at 130, the method performs passage retrieval operations againstthose identified matched metadata documents obtained from the subcorpus92 formed at step 120 or 125.

Finally, assuming a search engine has internal document ranking ability,then at 135, there is returned the resulting list of ranked passages at125.

In one embodiment, the passage retrieval process 100, FIG. 3 whenperformed in parallel with traditional passage retrieval algorithms ismore effective when the information sought in the question is present indocuments whose relevant metadata field contains a term/phrase in thequestion. To increase recall, the dictionary can be constructed toinclude morphological variations for the given metadata information,such as including both the singular and plural forms of terms, as wellas known synonyms. In one embodiment, redirect links between Wikipedia®titles (which, e.g., redirects requests for the document “artists” tothe document titled “artist” and for example, “Ol' Blue Eyes” to “FrankSinatra”) are used to capture morphological variations and synonyms.Alternatively, morphological and synonym information can be mined frompublicly available resources such as WordNet® (Trademark of TheCORPORATION NEW JERSEY Princeton University) available athttp://wordnet.princeton.edu/. For these questions, this approachsignificantly reduces execution time in those situations compared withperforming passage retrieval against a large unconstrained corpus 89.

As mentioned, FIG. 1 shows a system diagram described in U.S. patentapplication Ser. No. 12/126,642 depicting a high-level logicalarchitecture of a QA system 10 and methodology in which a system andmethod for deferred type evaluation using text with limited structure isemployed in one embodiment.

Generally, as shown in FIG. 1, the high level logical architecture 10includes the Query Analysis module 20 implementing functions forreceiving and analyzing a user query or question. The term “user” mayrefer to a person or persons interacting with the system, or refers to acomputer system 22 generating a query by mechanical means, and where theterm “user query” refers to such a mechanically generated query andcontext 19′. A candidate answer generation module 30 is provided toimplement a search for candidate answers by traversing structured, semistructured and unstructured sources contained in primary sources (e.g.,the Web, a data corpus 41) and in an Answer Source or a Knowledge Base(KB), e.g., containing collections of relations and lists extracted fromprimary sources. All the sources of information can be locally stored ordistributed over a network, including the Internet.

The Candidate Answer generation module 30 of architecture 10 generates aplurality of output data structures containing candidate answers basedupon the analysis of retrieved data. In FIG. 1, an Evidence Gatheringmodule 50 further interfaces with the primary sources and knowledge basefor concurrently analyzing the evidence based on passages havingcandidate answers, and scores each of candidate answers, in oneembodiment, as parallel processing operations. In one embodiment, thearchitecture may be employed utilizing the Common Analysis System (CAS)candidate answer structures as is described in commonly-owned, issuedU.S. Pat. No. 7,139,752, the whole contents and disclosure of which isincorporated by reference as if fully set forth herein.

As depicted in FIG. 1, when the Search System 30 a is employed in thecontext of a QA system, the Evidence Gathering and Scoring module 50comprises a Candidate Answer Scoring module 40 for analyzing a retrievedpassage and scoring each of candidate answers of a retrieved passage.The Answer Source Knowledge Base (KB) may comprise one or more databasesof structured or semi-structured sources (pre-computed or otherwise)comprising collections of relations (e.g., Typed Lists). In an exampleimplementation, the Answer Source knowledge base may comprise a databasestored in a memory storage system, e.g., a hard drive.

An Answer Ranking module 60 may be invoked to provide functionality forranking candidate answers and determining a response 99 returned to auser via a user's computer display interface (not shown) or a computersystem 22, where the response may be an answer, or an elaboration of aprior answer or request for clarification in response to a question—whena high quality answer to the question is not found. A machine learningimplementation is further provided where the “answer ranking” module 60includes a trained model component (not shown) produced using a machinelearning techniques from prior data.

The processing depicted in FIG. 1, may be local, on a server, or servercluster, within an enterprise, or alternately, may be distributed withor integral with or otherwise operate in conjunction with a public orprivately available search engine in order to enhance the questionanswer functionality in the manner as described. Thus, the method may beprovided as a computer program product comprising instructionsexecutable by a processing device, or as a service deploying thecomputer program product. The architecture employs a search engine(e.g., a document retrieval system) as a part of Candidate AnswerGeneration module 30 which may be dedicated to searching the Internet, apublicly available database, a web-site (e.g., IMDB.com), a privatelyavailable collection of documents or, a privately available database.Databases can be stored in any storage system, non-volatile memorystorage systems, e.g., a hard drive or flash memory, and can bedistributed over the network or not.

In one embodiment, when employed in a QA system, the system and methodof FIG. 1 makes use of the Common Analysis System (CAS), a subsystem ofthe Unstructured Information Management Architecture (UIMA) that handlesdata exchanges between the various UIMA components, such as analysisengines and unstructured information management applications. CASsupports data modeling via a type system independent of programminglanguage, provides data access through a powerful indexing mechanism,and provides support for creating annotations on text data, such asdescribed in (http://www.research.ibm.com/journal/sj/433/gotz.html)incorporated by reference as if set forth herein. It should be notedthat the CAS allows for multiple definitions of the linkage between adocument and its annotations, as is useful for the analysis of images,video, or other non-textual modalities (as taught in the hereinincorporated reference U.S. Pat. No. 7,139,752).

In one embodiment, UIMA may be provided as middleware for the effectivemanagement and interchange of unstructured information over a wide arrayof information sources. The architecture generally includes a searchengine, data storage, analysis engines containing pipelined documentannotators and various adapters. The UIMA system, method and computerprogram may be used to generate answers to input queries. The methodincludes inputting a document and operating at least one text analysisengine that comprises a plurality of coupled annotators for tokenizingdocument data and for identifying and annotating a particular type ofsemantic content. Thus it can be used to analyze a question and toextract entities as possible answers to a question from a collection ofdocuments.

In an alternative environment, modules of FIGS. 1, 2 can be representedas functional components in GATE (General Architecture for TextEngineering) (see:http://gate.ac.uk/releases/gate-2.0alpha2-build484/doc/userguide.html).GATE employs components which are reusable software chunks withwell-defined interfaces that are conceptually separate from GATE itself.All component sets are user-extensible and together are called CREOLE—aCollection of REusable Objects for Language Engineering. The GATEframework is a backplane into which plug CREOLE components. The usergives the system a list of URLs to search when it starts up, andcomponents at those locations are loaded by the system. In oneembodiment, only their configuration data is loaded to begin with; theactual classes are loaded when the user requests the instantiation of aresource.). GATE components are one of three types of specialized JavaBeans: 1) Resource: The top-level interface, which describes allcomponents. What all components share in common is that they can beloaded at runtime, and that the set of components is extendable byclients. They have Features, which are represented externally to thesystem as “meta-data” in a format such as RDF, plain XML, or Javaproperties. Resources may all be Java beans in one embodiment. 2)ProcessingResource: Is a resource that is runnable, may be invokedremotely (via RMI), and lives in class files. In order to load a PR(Processing Resource) the system knows where to find the class or jarfiles (which will also include the metadata); 3) LanguageResource: Is aresource that consists of data, accessed via a Java abstraction layer.They live in relational databases; and, VisualResource: Is a visual Javabean, component of GUIs, including of the main GATE GUI Like PRs thesecomponents live in .class or .jar files.

In describing the GATE processing model any resource whose primarycharacteristics are algorithmic, such as parsers, generators and so on,is modeled as a Processing Resource. A PR is a Resource that implementsthe Java Runnable interface. The GATE Visualisation Model implementsresources whose task is to display and edit other resources are modeledas Visual Resources. The Corpus Model in GATE is a Java Set whosemembers are documents. Both Corpora and Documents are types of LanguageResources (LR) with all LRs having a Feature Map (a Java Map) associatedwith them that stored attribute/value information about the resource.FeatureMaps are also used to associate arbitrary information with rangesof documents (e.g. pieces of text) via an annotation model. Documentshave a DocumentContent which is a text at present (future versions mayadd support for audiovisual content) and one or more AnnotationSetswhich are Java Sets.

As UIMA, GATE can be used as a basis for implementing natural languagedialog systems and multimodal dialog systems having a question answeringsystem as one of the main submodules. The references, incorporatedherein by reference above (U.S. Pat. Nos. 6,829,603 and 6,983,252, and7,136,909) enable one skilled in the art to build such animplementation.

FIG. 4 illustrates an exemplary hardware configuration of a computingsystem 400 in which the present system and method may be employed. Thehardware configuration preferably has at least one processor or centralprocessing unit (CPU) 411. The CPUs 411 are interconnected via a systembus 412 to a random access memory (RAM) 414, read-only memory (ROM) 416,input/output (I/O) adapter 418 (for connecting peripheral devices suchas disk units 421 and tape drives 440 to the bus 412), user interfaceadapter 422 (for connecting a keyboard 424, mouse 426, speaker 428,microphone 432, and/or other user interface device to the bus 412), acommunication adapter 434 for connecting the system 400 to a dataprocessing network, the Internet, an Intranet, a local area network(LAN), etc., and a display adapter 436 for connecting the bus 412 to adisplay device 438 and/or printer 439 (e.g., a digital printer of thelike).

As will be appreciated by one skilled in the art, aspects of the presentinvention may be embodied as a system, method or computer programproduct. Accordingly, aspects of the present invention may take the formof an entirely hardware embodiment, an entirely software embodiment(including firmware, resident software, micro-code, etc.) or anembodiment combining software and hardware aspects that may allgenerally be referred to herein as a “circuit,” “module” or “system.”Furthermore, aspects of the present invention may take the form of acomputer program product embodied in one or more computer readablemedium(s) having computer readable program code embodied thereon.

Any combination of one or more computer readable medium(s) may beutilized. The computer readable medium may be a computer readable signalmedium or a computer readable storage medium. A computer readablestorage medium may be, for example, but not limited to, an electronic,magnetic, optical, electromagnetic, infrared, or semiconductor system,apparatus, or device, or any suitable combination of the foregoing. Morespecific examples (a non-exhaustive list) of the computer readablestorage medium would include the following: an electrical connectionhaving one or more wires, a portable computer diskette, a hard disk, arandom access memory (RAM), a read-only memory (ROM), an erasableprogrammable read-only memory (EPROM or Flash memory), an optical fiber,a portable compact disc read-only memory (CD-ROM), an optical storagedevice, a magnetic storage device, or any suitable combination of theforegoing. In the context of this document, a computer readable storagemedium may be any tangible medium that can contain, or store a programfor use by or in connection with a system, apparatus, or device runningan instruction.

A computer readable signal medium may include a propagated data signalwith computer readable program code embodied therein, for example, inbaseband or as part of a carrier wave. Such a propagated signal may takeany of a variety of forms, including, but not limited to,electro-magnetic, optical, or any suitable combination thereof. Acomputer readable signal medium may be any computer readable medium thatis not a computer readable storage medium and that can communicate,propagate, or transport a program for use by or in connection with asystem, apparatus, or device running an instruction.

Program code embodied on a computer readable medium may be transmittedusing any appropriate medium, including but not limited to wireless,wireline, optical fiber cable, RF, etc., or any suitable combination ofthe foregoing.

Computer program code for carrying out operations for aspects of thepresent invention may be written in any combination of one or moreprogramming languages, including an object oriented programming languagesuch as Java, Smalltalk, C++ or the like and conventional proceduralprogramming languages, such as the “C” programming language or similarprogramming languages. The program code may run entirely on the user'scomputer, partly on the user's computer, as a stand-alone softwarepackage, partly on the user's computer and partly on a remote computeror entirely on the remote computer or server. In the latter scenario,the remote computer may be connected to the user's computer through anytype of network, including a local area network (LAN) or a wide areanetwork (WAN), or the connection may be made to an external computer(for example, through the Internet using an Internet Service Provider).

Thus, in one embodiment, the system and method for efficient passageretrieval may be performed with data structures native to variousprogramming languages such as Java and C++.

Aspects of the present invention are described below with reference toflowchart illustrations and/or block diagrams of methods, apparatus(systems) and computer program products according to embodiments of theinvention. It will be understood that each block of the flowchartillustrations and/or block diagrams, and combinations of blocks in theflowchart illustrations and/or block diagrams, can be implemented bycomputer program instructions. These computer program instructions maybe provided to a processor of a general purpose computer, specialpurpose computer, or other programmable data processing apparatus toproduce a machine, such that the instructions, which run via theprocessor of the computer or other programmable data processingapparatus, create means for implementing the functions/acts specified inthe flowchart and/or block diagram block or blocks. These computerprogram instructions may also be stored in a computer readable mediumthat can direct a computer, other programmable data processingapparatus, or other devices to function in a particular manner, suchthat the instructions stored in the computer readable medium produce anarticle of manufacture including instructions which implement thefunction/act specified in the flowchart and/or block diagram block orblocks.

The computer program instructions may also be loaded onto a computer,other programmable data processing apparatus, or other devices to causea series of operational steps to be performed on the computer, otherprogrammable apparatus or other devices to produce a computerimplemented process such that the instructions which run on the computeror other programmable apparatus provide processes for implementing thefunctions/acts specified in the flowchart and/or block diagram block orblocks.

The flowchart and block diagrams in the Figures illustrate thearchitecture, functionality, and operation of possible implementationsof systems, methods and computer program products according to variousembodiments of the present invention. In this regard, each block in theflowchart or block diagrams may represent a module, segment, or portionof code, which comprises one or more operable instructions forimplementing the specified logical function(s). It should also be notedthat, in some alternative implementations, the functions noted in theblock may occur out of the order noted in the figures. For example, twoblocks shown in succession may, in fact, be run substantiallyconcurrently, or the blocks may sometimes be run in the reverse order,depending upon the functionality involved. It will also be noted thateach block of the block diagrams and/or flowchart illustration, andcombinations of blocks in the block diagrams and/or flowchartillustration, can be implemented by special purpose hardware-basedsystems that perform the specified functions or acts, or combinations ofspecial purpose hardware and computer instructions.

The embodiments described above are illustrative examples and it shouldnot be construed that the present invention is limited to theseparticular embodiments. Thus, various changes and modifications may beeffected by one skilled in the art without departing from the spirit orscope of the invention as defined in the appended claims.

1. A computer program product for efficiently retrieving relevantpassages to questions based on a corpus of data, the computer programdevice comprising a non-transitory storage medium readable by aprocessing circuit and storing instructions run by the processingcircuit for performing a method, the method comprising: receiving aninput query; performing a query context analysis upon said input queryto obtain searchable query terms; matching metadata associated with oneor more documents against said query terms; mapping matched documentmetadata to corresponding one or more documents; identifyingcorresponding matched documents to form a subcorpus of documents; andconducting a search in said data subcorpus using said searchable queryterms to obtain one or more passages relevant to the input query fromsaid identified.
 2. The computer program product of claim 1, wherein thedocument metadata includes one or more of: a title of the documents, oneor more user tags, one or more automatically identified document labels.3. The computer program product of claim 1, wherein prior to matching ofmetadata associated with one or more documents against said query terms:extracting document metadata from one or more documents of a corpus ofdocuments; providing said extracted document metadata as a dictionary ina storage device, each document metadata stored in said dictionary beingassociated with one or more corresponding document identifications. 4.The computer program product of claim 3, wherein said matching ofmetadata against said query terms comprises: performing, by saidprocessor device, dictionary matching.
 5. The computer program productof claim 1, wherein said data corpus comprising document metadatainformation includes variations of metadata including one or more of:singular and plural forms of metadata terms, and synonyms for metadataterms.
 6. The computer program product of claim 2, wherein obtainingsearchable query terms from said input query comprises parsing, by saidprocessor device, said input query to obtain terms matching documentmetadata.
 7. The computer program product of claim 2, wherein saididentifying corresponding matched documents to form a subcorpus ofdocuments includes tagging or flagging each matched metadata documentsin said corpus of documents.
 8. The computer program product of claim 2,further comprising: extracting said tagged or flagged identifiedcorresponding matched documents to form said subcorpus of documents. 9.A system for efficiently retrieving relevant passages to questions basedon a corpus of data comprising: a memory storage device; a processordevice in communication with the memory device that performs a methodcomprising: receiving an input query; performing a query contextanalysis upon said input query to obtain searchable query terms;matching metadata associated with one or more documents against saidquery terms; mapping matched document metadata to corresponding one ormore documents; identifying corresponding matched documents to form asubcorpus of documents; and conducting a search in said data subcorpususing said searchable query terms to obtain one or more passagesrelevant to the input query from said identified documents.
 10. Thesystem of claim 9, wherein the document metadata includes one or moreof: a title of the documents, one or more user tags, one or moreautomatically identified document labels.
 11. The system of claim 10,wherein prior to matching of metadata associated with one or moredocuments against said query terms: extracting document metadata fromone or more documents of a corpus of documents; providing said extracteddocument metadata as a dictionary in a storage device, each documentmetadata stored in said dictionary being associated with a correspondingdocument identification, wherein said matching of metadata against saidquery terms comprises performing a dictionary matching.
 12. A computerprogram product for efficiently retrieving relevant passages toquestions based on a corpus of data, the computer program devicecomprising a storage medium readable by a processing circuit and storinginstructions run by the processing circuit for performing a method, themethod comprising: receiving, at a processor device, an input query;performing, at said processor device, a query context analysis upon saidinput query to obtain searchable query terms; accessing a dictionary ofdocument metadata obtained from one or more documents of the datacorpus, each stored document metadata being associated with acorresponding document identification (ID); performing, by saidprocessor device, a dictionary matching of said metadata associated withone or more documents against said query terms; mapping matched documentmetadata to corresponding one or more document IDs; identifyingcorresponding matched documents to form a subcorpus of documents; andconducting a search in said subcorpus using said searchable query termsto obtain one or more passages relevant to the input query from saididentified documents.
 13. The computer program product of claim 12,wherein the document metadata includes one or more of: a title of thedocuments, one or more user tags, one or more automatically identifieddocument labels.